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Abstract 

Many statistical estimation techniques for high-dimensional or functional 
data are based on a preliminary dimension reduction step, which con- 
sists in projecting the sample Xi , . . . , X„ onto the first D eigenvectors of 
the Principal Component Analysis (PCA) associated with the empirical 
projector II_d- Classical nonparametric inference methods such as ker- 
nel density estimation or kernel regression analysis are then performed 
in the (usually small) D-dimensional space. However, the mathematical 
analysis of this data-driven dimension reduction scheme raises technical 
problems, due to the fact that the random variables of the projected sam- 
ple (IIdXi, . . . jlioXn) are no more independent. As a reference for fur- 
ther studies, we offer in this paper several results showing the asymptotic 
equivalencies between important kernel-related quantities based on the 
empirical projector and its theoretical counterpart. As an illustration, we 
provide an in-depth analysis of the nonparametric kernel regression case. 

Index Terms — Principal Component Analysis, Dimension reduction, 
Nonparametric kernel estimation, Density estimation. Regression estima- 
tion, Perturbation method. 

AMS 2000 Classification: 62G05, 62G20. 

1 Introduction 

Nonparametric curve estimation provides a useful tool for exploring and un- 
derstanding the structure of a data set, especially when parametric models are 
inappropriate. A large amount of progress has been made in the 90's in both the 
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design and the study of inferential aspects of nonparametric estimates. There 
are too many references to be included here, but the monographs of Silverman 
[25] , Scott [21] , Simonoff [24 and Gyorfi et al. [TT] will provide the reader with 
good introductions to the general subject area. 

Among all the nonparametric methods which have been proposed so far, ker- 
nel estimation has gained favor from many data analysts, probably because of 
its simplicity to implement and good statistical properties — see for example Si- 
monoff [21] for a variety of real data examples which illustrate the power of 
the approach. Kernel estimates were originally studied in density estimation 
by Rosenblatt [19] and Parzen [17], and were latter introduced in regression 
estimation by Nadaraya [TSl [H] and Watson [27] . A compilation of the math- 
ematical properties of kernel estimates can be found in Prakasa Rao [T5] (for 
density estimation), Gyorfi et al. [TT] (for regression) and Devroye et al. [5] (for 
classification and pattern recognition). To date, most of the results pertaining 
to kernel estimation have been reported in the finite-dimensional case, where 
it is assumed that the observation space is the standard Euclidean space W^. 
However, in an increasing number of practical applications, input data items 
are in the form of random functions (speech recordings, multiple time series, 
images...) rather than standard vectors, and this casts the problem into the 
general class of functional data analysis. Motivated by this broad range of po- 
tential applications, Ferraty and Vieu describe in [5] a possible route to extend 
kernel estimation to potentially infinite-dimensional spaces. 

On the other hand, it has become increasingly clear over the years that the 
performances of kernel estimates deteriorate as the dimension of the problem 
increases. The reason for this is that, in high dimensions, local neighborhoods 
tend to be empty of sample observations unless the sample size is very large. 
Thus, in kernel estimation, there will be no local averages to take unless the 
bandwidth is very large. This general problem was termed the curse of dimen- 
sionality (Bellman [T]) and, in fact, practical and theoretical arguments suggest 
that kernel estimation beyond 5 dimensions is fruitless. The paper by Scott 
and Wand [22] gives a good account on the feasibility and difhculties of high- 
dimensional estimation, with examples and computations. 

In order to circumvent the high-dimension difficulty and make kernel estimation 
simpler, a wide range of techniques have been developed. One of the most 
common approaches is a two-stage strategy: first reduce the dimension of the 
data and then perform — density or regression — kernel estimation. With this 
respect, a natural way to reduce dimension is to extract the largest D principal 
component axes (with D chosen to account for most of the variation in the data) , 
and then operate in this D-dimensional space, thereby improving the ability to 
discover interesting structures (Jee [12], Friedman [9 and Scott [21], Chapter 
7). To illustrate more formally this mechanism, let (J^, (., .), 1|.||) be a (typically 
high or infinite-dimensional) separable Hilbert space, and consider for example 
the regression problem, where we observe a set Vn = {(Xi, Yi), . . . , (X„, K„)} 
of independent T x R-valued random variables with the same distribution as a 
generic pair (X, Y) satisfying E|y | < oo. The goal is to estimate the regression 
function r(x) = E[y|X = x] using the data Vn- The kernel estimate of the 
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function r takes the form 



r„(x) ~ 7^ 7^ 

if the denominator is nonzero, and otherwise. Here the bandwidth ft,„ > de- 
pends only on the sample size n, and the function K : [0,oo) — >■ [0, oo) is called 
a kernel. Usually, K{v) is "large" if v is "small", and the kernel estimate is 
therefore a local averaging estimate. Typical choices for K are the naive kernel 
K{v) = l[Q^i]{v), the Epanechnikov kernel K{v) = (1 — and the Gaussian 

kernel K{v) — exp(— 

As explained earlier, the estimate r„ is prone to the curse of dimensionality, 
and the strategy advocated here is to first reduce the ambient dimension by 
the use of Principal Component Analysis (PCA, see for example Dauxois et 
al. [5] and Jolliffe [13] )• More precisely, assume without loss of generality 
that EX = 0, E||Xf < oo, and let r(.) = E[(X, •)X] be the covariance 
operator of X and IId be the orthogonal projector on the collection of the 
first D eigenvectors {ei, ... ,613} of F associated with the first D eigenvalues 
•^1 > -^2 > • • • > -^D > 0. In the sequel we will assume as well that the distribu- 
tion of X is nonatomic. 



In this context, the PCA-kernel regression estimate reads 

The hope here is that the most informative part of the distribution of X should 
be preserved by projecting the observations on the first D principal component 
axes, so that the estimate should still do a good job at estimating r while 
performing in a reduced-dimensional space. Alas, on the practical side, the 
smoother is useless since the distribution of X (and thus, the projector II^i) 
is usually unknown, making of what is called a "pseudo-estimate" . However, 
the covariance operator F can be approximated by its empirical version 

1 " 

F„(.) = -^(X„.)X„ (1.1) 

and H/j is in turn approximated by the empirical orthogonal projector H/j on 
the (empirical) eigenvalues {ei,...,e£)} of F„. Thus, the operational version 
of the pseudo-estimate takes the form 

v-n ^ ||riD(x-X.)|| \ 

Unfortunately, from a mathematical point of view, computations involving the 
numerator or the denominator of the estimate are difficult, since the random 
variables (if (||Hc(x — Xi)||//i„))i<i<„ are identically distributed but clearly 
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not independent. Besides, due to nonlinearity, the distribution of ||nD(x — Xi)|| 
is usually inaccessible, even when the X^'s have known and simple distributions. 
In short, this makes any theoretical calculation impossible, and it essentially ex- 
plains why so few theoretical results have been reported so far on the statistical 
properties of the estimate f^, despite its wide use. On the other hand, we 
note that the random variables {K{\\IId{-k — ^i)\\/hn))i<i<n are independent 
and identically distributed. Therefore, the pseudo-estimate is amenable to 
mathematical analysis, and fundamental asymptotic theorems such that the law 
of large numbers and the central limit theorem may be applied. 

In the present contribution, we prove that and have the same asymptotic 
behavior and show that nothing is lost in terms of rates of convergence when 
replacing by (Section 4). In fact, taking a more general view, we offer in 
Section 3 a thorough asymptotic comparison of the partial sums 



with important consequences in kernel density estimation. As an appetizer, 
we will first carry out in Section 2 a preliminary analysis of the asymptotic 
proximity between the projection operators Hd and n^. Our approach will 
strongly rely on the representation of the operators by Cauchy integrals, through 
what is classically known in analysis as perturbation method. For the sake of 
clarity, proofs of the most technical results are postponed to Section 5. 

2 Asymptotics for PCA projectors 

Here and in the sequel, we let (J^, (.,.}, ||.||) be a separable Hilbert space and 
Xi, . . . , X„ be independent random variables, distributed as a generic nonato- 
mic and centered random X satisfying ]E||Xp < oo. Denoting by F the co- 
variance operator of X, we let II^i be the orthogonal projection operator on 
{ei, . . . , e£)}, the set of first D eigenvectors of F associated with the (nonnega- 
tive) eigenvalues {Ai, . . . , A^} sorted by decreasing order. The empirical version 
F„ of F is defined in and we denote by {ei, . . . , e^i} and {Ai, . . . /Xjy} 

the associated empirical eigenvector and (nonnegative) eigenvalue sets, respec- 
tively, based on the sample Xi, . . . , X„. To keep things simple, we will assume 
throughout that the projection dimension D is fixed and independent of the 
observations (for data-dependent methods regarding the choice of D, see for 
example Jolliffe p[3])- Besides, and without loss of generality, it will also be 
assumed that Ai > . . . > Xd+i- This assumption may be removed at the ex- 
pense of more tedious calculations taking into account the dimension of the 
eigenspaces (see for instance 14J for a generic method). 

The aim of this section is to derive new asymptotic results regarding the em- 
pirical projector n^j on {ei, . . . , e^} as the sample size n grows to infinity. Let 
us first recall some elementary facts from complex analysis. The eigenvalues 
{Ai, . . . , A/)} are nonnegative real numbers, but we may view them as points 
in the complex plane C. Denote by C a closed oriented contour in C, that is a 




4 



closed curve (for instance, the boundary of a rectangle) endowed with a circu- 
lation. Suppose first that C =Ci contains Ai only. Then, the so-called formula 
of residues (Rudin [30]) asserts that 

/ ^^ = 1 and / ^^=0 forz^l. 
Jci z-\i Jci k 

In fact, this formula may be generalized to functional calculus for operators. 
We refer for instance to Dunford and Schwartz [7] or Gohberg et al. [TU] for 
exhaustive information about this theory, which allows to derive integration 
formulae for functions with operator values, such as 

Hi = / {zl-ry'^dz. 
Jci 

Thus, in this formalism, the projector on ei is explicitly written as a function 
of the covariance operator. Clearly, the same arguments allow to express the 
empirical projector Hi as 

where Ci is a (random) contour which contains Ai and no other eigenvalue of 
r„. These formulae generalize and, letting Cd (respectively Cd) be contours 
containing {Ai, . . . , Xd} (respectively {Ai, . . . , Ad}) only, we may write 

Ud^ {zl-ry^dz and Ud ^ (z/-r„)"^dz. 

Jco Jcd 

The contours Cd rnay take different forms. However, to keep things simple, 
we let in the sequel Cd be the boundary of a rectangle as in Figure [U with a 
right vertex intercepting the real line ata; = Ai + l/2 and a left vertex passing 
through X ^ \d — 5di with 

r _ Xd — Xd+1 



With a slight abuse of notation, we will also denote by Cd the corresponding 
rectangle. 



Thus, with this choice, Cd contains {Ai, . . . , Xd} an no other eigenvalue. Lemma 
l2.1l below. which is proved in Section 5, shows that, asymptotically, this assertion 
is also true with {Ai, . . . , Ad} in place of {Ai, . . . , Ad}- In the sequel, the letter 
C will denote a positive constant, the value of which may vary from line to line. 
Moreover, the notation \\-\\^ and ||-||2 will stand for the classical operator and 
Hilbert-Schmidt norms, which are respectively defined by 

oo 

||T|U = sup llTxIl and ||T||^ ^ ^ ||Tu,f , 

where Bi denotes the closed unit ball of F and (up)^^^^ a Hilbertian basis of 
It is known (Dunford and Schwartz [7]) that the value of ||r||2 does not 
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Figure 1: Oriented rectangle-contour Co, with a right vertex intercepting the 
real line at x = Ai + 1/2 and a left vertex passing through x — Xd — Sd, 
Sd = (Ad - A_D+i)/2. 

depend on the actual basis and that \\-\\^ < IHIj- The Hilbert-Schmidt norm is 
of more generalized use, essentially because it yields simpler calculations than 
the sup-norm. 

As promised, the next lemma ensures that the empirical eigenvalues are located 
in the rectangle Co through an exponential concentration inequality. 

Lemma 2.1 For all n > 1, let the event 

An = |Ai e Cd, i = I,. . . ,D, and Xd+i ^ Cd j . 

There exists a positive constant C such that 

Remark that the constants involved in the document depend on the actual di- 
mension D and their values increase as D becomes large. To circumvent this 
difficulty, a possible approach is to let D depend on n. This is beyond the scope 
of the present paper, and we refer to Cardot et al. i4 for some perspectives in 
this direction. 

We are now in a position to state the main result of the section. Theorem 
12.11 below states the asymptotic proximity of the operators and Hdi as 
n becomes large, with respect to different proximity criteria. We will make 
repeated use of this result throughout the document. We believe however that 
it is interesting by itself. For a sequence of random variables {Zn)n>i and a 
positive sequence (fn)n>i, notation Z„ = 0{vn) a.s. means that each random 
draw of Z„ is 0{vn)- 

Theorem 2.1 The following three assertions are true for all n > 1; 
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(i) There exists a positive constant C such that, for all £ > 0, 



(ii) One has 



{Hi) One has 



Hd-TId 



> e) ^ O {exp {-Cne^)) 



O 



E 



logn 



oil 



Proof of Theorem 12.11 The proof will be based on arguments presented in 
Mas and Menneteau [TJ . Using the notation of Lemma 12. 1[ we start from the 
decomposition 



Consequently, 

OO 

< p f He - 



(2.1) 



> e/2) +P 



- Hz 



> 



Observing that 



l.Af, < 21. 



no-no 

we conclude by Lemma [Ol that 

pfn^-n^ i^e >e/2) <p(A^j 

\ OO / 

= O (exp(-nCe2)) 
With respect to the second term in (|2.1I) . write 



(2.2) 



(n, 



D - no] i-A 



dz 



(z/-r„) ^(r„-r)(z/-r)- 



Az. 



Let io be the length of the contour Co ■ Using elementary properties of Riesz 
integrals (Gohberg et al. [TO]), we obtain 



no-no 



<^D||r„-r||^ sup (zJ-r„)-^ [zI-t)-^ 



Observing that the eigenvalues of the symmetric operator {zl — T) are the 



{(z-A,)-^^eN'^}, 



we see that 



(zI-T) ^ = OiSo). The same bound 



is valid taking r„ instead of T, when An holds. In consequence, 



no-no 



o(lir„-r|| 



(2.3) 



7 



The conclusion follows from the inequalities (I2.ip - (|2.2p - (l2.3l) . the inequality 
||r„ — r||^ < ||r„ — r||2 and the asymptotic properties of the sequence (r„ — 
r)„>i (Bosq [3], Chapter 4). ■ 



3 Some asymptotic equivalencies 

As for now, we assume D > 2 and let 



llip (x-X,)|| 

K 



and Sn (x) = K 

i=l 



(x - X, 



We note that S'„(x) is a sum of independent and identically distributed random 
variables, whereas the terms in ^^(x) have the same distribution but are not 
independent. In light of the results of Section 2, our goal in this section will be 
to analyse the asymptotic proximity between 5'„(x) and S'„(x) under general 
conditions on K and the sequence {hn)n>i- Throughout, we will assume that 
the kernel K satisfies the following set of conditions: 



Assumption Set K 

(Kl) K is positive and bounded with compact support [0, 1]. 
(K2) if is of class on [0,1]. 

These assumptions are typically satisfied by the naive kernel K[v) — l[o,i](w). In 
fact, all the subsequent results also hold for kernels with an unbounded support, 
provided K is Lipschitz — we leave to the reader the opportunity to check the 
details and adapt the proofs, which turn out to be simpler in this case. For any 
integer p > 1, we set 

Md,p = D I v^-^KP{v)dv 
Jo 

and, for all x G 7^ and h > 0, we let 

F^ih) - P (UdX e BoillDX, h)) , 

where Bniujh) denotes the closed Euclidean ball of dimension D centered at 
u and of radius h. In the subsequent developments, to lighten notation a bit, 
and since no confusion is possible, we will write F{h) instead of Fx{h). Observe 
that F{h) is positive for /i-almost all x <E F, where /i is the distribution of X. 
Besides, by decreasing monotonicity, since X is nonatomic, we have 

\im F(h) = 0. 

When the projected random variable fl^X has a density / with respect to the 
Lebesgue measure A on M.^ , then F{h) ^ 7£)/(x)/i^ as ft, — 0, where 7£) is a 
positive constant, for A-almost all x (see for instance Wheeden and Zygmund 
|28|). Thus, in this case, the function F is regularly varying with index D. We 
generalise this property below. 



Assumption Set R 
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(Rl) F is regularly varying at with index D. 
Assumption Rl means that, for any m > 0, 

S-S-0+ F (s) 

The index of regular variation was fixed to D in order to alleviate the nota- 
tion, but the reader should note that our results hold for any positive index, 
with different constants however. In fact this index is directly connected with 
the support of the distribution of X. To see this, observe that by fixing the 
index to D we implicitly assume that n^X fills the whole space of dimen- 
sion D. However, elementary calculations show that most distributions in R^, 
when concentrated on a subspace of smaller dimension D' < D, will match 
assumption Rl with D' instead of D. Moreover, representation theorems for 
regularly varying functions (see Bingham et al. [5]) show that, under Rl, F 
may be rewritten as F{u) = L (u), where the function L is slowly varying at 
0, that is lims^Q+ L (su) / L{s) = 1. This enables to consider functions F with 
non-polynomial behaviour such as, for instance, F{u) ~ Cu^\ \nu\ as u ^ 0+. 
Observe also that F{u) is negligible with respect to v? as soon as D > 2. 



We start the analysis with two technical lemmas. Proof of Lemma ISTTl is deferred 
to Section 5, whereas Lemma 13.21 is an immediate consequence of Lemma 13.11 
and Bennett's inequality. Its proof is therefore omitted. 

Lemma 3.1 Assume that Assumption Sets K and R are satisfied. Then, for 
^-almost all x, if hn \r 0, 



Lemma 3.2 Assume that Assumption Sets K and R are satisfied. Then, for 
^-almost all x, if hn \. and nF{hn)/ Inn — S> oo, 

5„(x) ~ MD,inF (hn) a.s. 

and 

ES'^(x) - [MosnF {hn)f as n ^ oo. 

The following proposition is the cornerstone of this section. It asserts that, 
asymptotically, the partial sums S'„(x) and S'„(x) behave similarly. 

Proposition 3.1 Assume that Assumption Sets K and R are satisfied and that 
X has bounded support. Then, for ^-almost all x, if hn iO and nF {hn) / hin — >■ 
oo, 

5,1 (x) S'„(x) a.s. 

and 

E5'^(x) -E5^(x) asn-^oo. 
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Proof of Proposition 13.1] To simplify notation a bit, we let, for i = 1, . . . , n, 

Vi — l|n_D(x — Xi)|| and Vi = ||IId(x — Xi)||. Let the events Si and Si be defined 

by 

Si = {Vi < h„} and Si = \Vi < ft,„| . 

Clearly, 

S'„(x) - S'„(x) 

n n 

= Y.^ i^^/f^n) -Y.^^ (^^Z'^") 



J2 [k (Vjhn) -K{V,/K) 

i=l 

n 



Therefore 

5,1 (x) - S'„(x) 

n 



< C 



.^||x-X,||l^:,+^(l^^^^. + l,. 



(3.1) 



i=l i=l 

Consequently, by Lemma 13.21 the result will be proved if we show that 

Hd -n^l' 



■^||x-X,||l£^ ^0 a.s. 



and 



1 " 

TTT E (l£.n£f + If 



nF(/i„) ^ 

2—1 



fnf.) 



a.s. as n — > oo. 



The first limit is proved in technical Lemma 15.1 1 and the second one in technical 
Lemma [ 



as n oo. 



We proceed now to prove the second statement of the proposition. We have to 
show that 

E52(x) 

Using the decomposition 

EU^ _ E[U-Vf E[V(U-V)] 



and the bound 



\E[ViU-V)]\ 

Ey2 



EV^ 



< 



:[u-vY 

EV^ 
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it will be enough to prove that 



S'„(x) - 5„(x) 



which in turn comes down to prove that 



^0, 



E 



S'„(x) - 5„(x) 



since ES',^j(x) ~ [Mjj^inF [hn)^ by Lemma 
Starting from inequality p.ip . we obtain 

1 2 

5„(x) - 5„(x) 



^0, 



< c 



c 



.1=1 



Consequently, the result will be proved if we show that 



E 



nhnF{hn) 



0. 



and 



E 



1 " 
nF(hn) ^ (^^-^^f ^ 

•i— 1 



^0 as n ^ oo. 



(3.2) 



The first limit is established in technical Lemma 1^751 and the second one in tech- 
nical Lemma 15.41 ■ 

The consequences of Proposition l3.1l in terms of kernel regression estimation will 
be thoroughly explored in Section 4. However, it has already important reper- 
cussions in density estimation, which are briefiy sketched here and may serve as 
references for further studies. Suppose that the projected random variable LE^X 
has a density / with respect to the Lebesgue measure A on R^. In this case, the 
PCA-kernel density estimate of / — based on the sample (Il£)Xi, . . . , Il£)X„) — 
reads 

5„(x) 



/n(x) 



and the associated pseudo-estimate — based on (n^Xi, . . . , n^Xn) — takes the 
form 

' ~ nhD ■ 
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An easy adaptation of the proof of Corollary 14. II in Section 4 shows that, under 
the conditions of Proposition 13.11 



E 



/n(x) - /„(x) 



= O 



nhl 



To illustrate the importance of this result, suppose for example that the target 
density / belongs to the class Qp of p-times continuously differentiable functions. 
In this context (Stone [551 US]): the optimal rate of convergence over Qp is 
j^-2p/(2p+D) g^j^jj ^j^g kernel density estimate with a bandwidth /i* x n-i/(2p+-D) 

achieves this minimax rate. Thus, letting /* (respectively /*) be the PC A- kernel 
density estimate (respectively pseudo-density estimate) based on this optimal 
bandwidth, we are led to 



/*(x)-/*(x) 



„-2p/(2p+D) 







as soon as D > 2. Thus, the L2-ia.te of convergence of /* towards /* is negligible 
with respect to the L2-T8ite of convergence of /* towards /. In consequence, 
replacing /* by /* has no effect on the asymptotic rate. The same ideas may 
be transposed without further effort to asymptotic normality and other error 
criteria. 



4 Regression analysis 

As framed in the introduction, we study in this final section the PCA-kernel 
regression procedure, which was our initial motivation. Recall that, in this 
context, we observe a set {(Xi, Yi), . . . , (X„, Yn)} of independent T x R-valued 
random variables with the same distribution as a generic pair (X, K), where X 
is nonatomic centered, and Y satisfies E|F| < oo. The goal is to estimate the 
regression function r^{x) = E[y|n£)X = IIux] via the PCA-kernel estimate, 
which takes the form 

Y fC f linD(x-Xi)| 



This estimate is mathematically intractable and we plan to prove that we can 
substitute without damage to the pseudo-estimate 

v-n Y / ||no(x-x.)|| \ 

To this aim, observe first that, with the notation of Section 3, 

5„(x) 



and 



D/,,\ _ ^n(x) 



S'«(x) ' 
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where, for all n > 1, 



Ud (x - X,) 



hn 



and 



Proposition 4.1 Assume that Assumption Sets K and R are satisfied, that X 
has bounded support and Y is bounded. Assume also that r^(x) ^ ana r 
is Lipschitz in a neighborhood of x. Then, for fi-almost all x, i/ /i„ J, and 
nF{hn)/ In n —> oo, 

Z„(x) Z„(x) a.s. 



EZ^(x) - EZ2(x) as n -> oo. 

Proof of Proposition 14.11 Using the Lipschitz property of r^ , we easily 
obtain by following the lines of Lemma l3. II that, at /i-almost all x, 



E 



YK 



(x-X) 



r^{^)F{hr,) 



and 



E 



2 1 iinD(x-x) 



Y^K 



hn 



CF {hn) as n — !- cx). 



Moreover, for /i-almost all x. 



^„(x) -nl 



YK 



\IiD (x - X) 



a.s. 



and 



EZ2(x) 



YK 



\^D (x-X) 

hn 



as n — > oo. 



The first equivalence is a consequence of Bennett's inequality and the fact that, 
for all large enough n and i — 1,2, 



hn 



> CF (hn) , 



which itself follows from the requirement r^(x) > 0. 



Finally, since Y is bounded, an inspection of the proof of Proposition lSTTl reveals 
that displays (|3.ip and (|3.2p may be verbatim repeated with S replaced by Z. 
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Corollary 4.1 Under the assumptions of Proposition \4-l\ for ^-almost all x, 
the estimate r^ and the pseudo-estimate r^ satisfy 



(x) 



(x) a.s. as n — > oo. 



Moreover, 



E[r^(x)-r,f (x)]'^0 



log {nhl 



Proof of Corollary 14.11 We start with the decomposition 



(x)-r- (x) 



5'„(x) 

which comes down to 

(x)~r^(x) ^ / 
(x) I 



1 - 



1 



5„(x) ] ^ 5„(x) 



5„(x) \ ^„(x) 
5„(x) ] + 5„(x) 



2'„(x) - Z„(x) 



^n(x) 



^„(x)) 



The first part of the coroUary is then an immediate consequence of Proposition 
EH and Proposition 14.11 

We turn to the second part. Note that Z„(x)/S'„(x) is bounded whenever 
Y is bounded. In consequence, we just need to provide upper bounds for 



the terms 



1 - S'„(x)/S'„(x) 



and 



Z„(x) - 2'„(x) /S'„(x) 



Besides, 



classical arguments show that the latter two expectations may be replaced by 

r n 2 r n 2 

E 5„(x) - S'„(x) /ES'2(x) and E Z„(x) - Z„(x) /E52(x), respectively It 

turns out that the analysis of each of these terms is similar, and we will therefore 

focus on the first one only. Given the result of Proposition [3^ this comes down 

r . -|2 2 

to analyse E S'„(x) — 5„(x) / [AIb^iuF {hn)] and to refine the bound. 

By inequality p.2p . we have 

2 



E 



5„(x) -S„(x) 



< CE 



C 



nhnF{hn) 

1 " 



(4.1) 



With respect to the first term. Lemma [5.31 asserts that 

2 



E 



nhnF{hn) 



i=l 



o 



1 

uKi, 



The second term in inequality (|4.ip is of the order C'(log(nft^)/(n/i^)), as proved 
in technical Lemma 15.51 This completes the proof. I 
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To illustrate the usefulness of Corollary 14.11 suppose that the regression func- 
tion belongs to the class Gp of p-times continuously differentiable functions. 
In this framework, it is well-known (Stone [351 US]) that the optimal rate of 
convergence on the class Gp is and that the kernel estimate with a 

bandwidth ft,* x 7t,-i/(2p+£') achieves this minimax rate. Plugging this optimal 
ft* into the rate of Corollary 14. 1[ we obtain 

E[r^(x)-r^ (x)]'^0(n-"), 

with a = {2p + D — 2) / {2p + D). This rate is strictly faster than the minimax 
rate n"^*'/^^^"'"^-' provided a > 2p/ {2p + D) or, equivalently, when D > 2. In 
this case, 

E[f^ (x)-r,f (x)]' 
lim — " ^ " ^ = 

and Corollary 14.11 claims in fact that the rate of convergence of towards 
is negligible with respect to the rate of convergence of towards . In 
conclusion, even if is the only possible and feasible estimate, carrying out its 
asymptotics from the pseudo-estimate is permitted. 



5 Proofs 

5.1 Proof of Lemma 12.11 

Observe first, since Ai > . . . > A^, that 

An = |Ai < Ai + 1/2, An > Ad - Sd, and A^+i < Ad - <5d| . 
Therefore 

< P (Ai - Ai > 1/2) +p(Xd-Xd< -Sd) + P f A^+i - Xd+i > Sd 



< 



Ai-A] 



>l/2 + 



>Sd] + 



>Sr 



The inequality 



sup 

i>l 



Xi — A, 



< lir. 



shifts the problem from \Xi — Xi\ to ||r„ — r||2. An application of a standard 
theorem for Hilbert- valued random variables (see for instance Bosq [3]) leads to 



ne 



for three positive constants ci, C2 and C3. Consequently, for fixed D, 

P(^^) = 0(exp(-nC£2)), 

where C is a positive constant depending on D. 
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5.2 Proof of Lemma 13.11 

The proof will be based on successive applications of Fubini's theorem. De- 
noting by fJ.D,:x.,h„ the probability measure associated with the random variable 
IIIId (x — X)|| /hn, we may write 



Thus 



EK 



fWlln (x-X) 



\TId (x~X) 



[ K{1)~ [ K'{s)ds 

Jo I Jv 

K{l)F{K)- 



K'{s) 

J[Q<v< 



AiLl,x,h„ [Av) As 



= K{1)F{K)- [ F{hns)K'{s)ds 

JO 



F{h„ 



Kil] 



F{hn) 



Using the fact that F is increasing regularly varying of order £>, an application 
of Lebegue's dominated convergence theorem yields 



EK 



(x-X) 



F{K) 



K{1)- I s'^K'{s)ds 





I.e., 



EK 



IHd (x-X)II 



F{K)D [ s^-^K{s)d. 
Jo 



s as n — > oo. 



This shows the first statement of the lemma. Proof of the second statement is 
similar. 

5.3 Some technical lemmas 

In this subsection, for all i = 1, . . . ,n, we let Vi — \\IId{^ ~ ^-nd % = 

||Il£)(x — Xi)||. The events Si and £i are defined by 

£i = {Vi < hn} and £i = |Vi < /i„| . 



Lemma 5.1 Assume that X has bounded support. Then, for fi-almost all x, 



nhnF (hn) 



O 



'logn 
nKi 



Proof of Lemma 15.11 According to statement (m) of Theorem 12. 11 



nhnFihn) 



= O 



log n 



n^hlF^hn) 



a.s. 
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Moreover, since X has bounded support, there exists a positive constant M such 
that, for /i-almost all x, 



^||x-XJl£. <M^l£, a.s. 



1=1 



i=l 



Clearly, X]r=i ^ Binomial distribution with parameters n and F (hn) 

and consequently, by Bennett's inequality, 

n 

Y,\\^-^^n£.^0{nF{h^)) a.s. 
j=i 

This completes the proof of the lemma. ■ 

Lemma 5.2 Assume that Assumption Set R is satisfied and X has bounded 
support. Then, if hn I- and nh'^/ log n — oo, 



1 " 



nF {h 



Proof of Lemma 15.21 Define k„ — and ri„ — Knhn, where C,^ is a 

constant which will be chosen later. Observe that 



{hn - Vu <V^< hn} n £^ U {V^ <hn- Vn} D £1 



Similarly 



£t = [V, > hn} = [V^ -V,>hn~ V,} . 

Consequently, we may write 

n n n 



i=l 



j=l 



-ri„<Vi<h„} + X ■'■{yi</i„-?7„,Vi-V'i>/i„-yi} 
i=l 
n 

^ l{?i„-r,„<y,</i„} + Y ^{v,-v.>v^} 

1=1 i=l 
n n 

^ Xll{''"-')"<^><''"} +Xll{|v'.-v,|>7,„}- (5-1) 

i=l i=l 

By Bennett's inequality, we have 

n 

l{;i„-,,„<y,<h„} (hn -rjn <Vi < hn) a.s. as n oo, 



i=l 



whence 



71-t2^1{''^-';-<v;</i„} FTTl a.s. (5.2) 



nF(/i„) ^ 



F{h„ 
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But, using the fact that F is regularly varying with index D, we may write 
F{K - Vn) 



1 



1 — (1 — Kn) ^ DKn — > as n ^> oo. (5-3) 



F [K) 

(Note that the value of the index D influences constants only.) 
Next, since X has bounded support, at /Lt-almost all x. 



El 

2 = 1 



E ■'■|||nD-nD|| ||x-Xi||>77„ 



} 



a.s. 



<"l{||n.-n.||^>,„/M} 
for some positive constant M . By statement (i) of Theorem 12. 11 we have 



( 



Therefore, 



whenever 



HD-nc > e) = C'(exp(-Cn£2)) 

oo / 

1 " 

;^El{|y.-v.|>.„}^0 a.s. 

oo 

^ exp {-Cnrjl/M'^) < oo. 



(5.4) 



Observing that ni]^ = C^logn, we see that the summability condition above is 
fulfilled as soon as is large enough. 

Combining inequality (|5.ip with (|5.2p - (|5.3l) and (|5.4p . we conclude that 
1 ^ 

— — r > Ip „pc — > a.s. as n — ?► OO. 
nF{hn) ^ 

One shows with similar arguments that 

1 ^ 

— — - — r > Ifcr^e a.s. as n — > oo. 
nF /i„) ^ ^.nf, 

2—1 



Lemma 5.3 Assume that X has bounded support. Then, for fi-almost all x, if 
nF (kn) — 7> CO, 



nhnF{h„ 



1=1 



o 
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Proof of Lemmaim For i = 1, . . . , n, let U^ = ||x - X^\\ l£^-E[||x - X^H l^J 
and write 



E 



Hd-tid 



< 2E 



1=1 



+ 2n^E^ [||x - X|| IgJ E \\Ud - Hd 



By Theorem 



E 



Ud-Ud 



= O 



and, since X has bounded support, for //-almost all x, 
E2[||x-X||l£j = 0(F2 (/!„)) 

Consequently, 



|x-X|ll£j! 



^0{nF\hn)) 



One easily shows, with methods similar to the ones used to prove (in) of The- 
orem [5111 that 

4 /I 



E 



Ud-Ud 



O 



Moreover, simple computations lead to 

-I 4 



= O {nF [K) + n^F^ {K)) = O {n^F^ (K)) 



when nF (hn) — > oo. Consequently, by Cauchy-Schwarz inequality. 



E 



-1 2 




n 




= O 






T) 






.i=l 





^0[F{K)). 
Putting all the pieces together, we obtain 

Ud-Ud 



E 



nhnF{hn) 



= O I — -J I as n ^ oo. 



Lemma 5.4 Assume that Assumption Set R is satisfied and X has bounded 
support. Then, if hn -I and nh'^/ log n ^ oo, 

-\ 2 



E 



1 



nF{hn) fr[ 



as n ^ oo. 
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Proof of Lemma 15.41 The proof is close to the derivation of Lemma \5.2\ — 
almost sure convergence is replaced here by convergence in mean square. There- 
fore, we go quickly through it. 

Because of (a + 6)^ < 2a^ + 2b^, it is enough to prove that 

-I 2 



E 



1 







and 



E 



1 ^ 

nF{K) ^ ^^^^^^ 



as n — > oo. 



We will focus on the first limit only — proof of the second one is similar. With 
the notation of Lemma [5^ by inequality (|5.ip . 



E 



1 " 

^ ' 2 — 1 



< 2E 



1 " 

^ ' i—l 
1 " 

where rjn is a tuning parameter which will be fixed later. 



2E 



(5.5) 



The first term on the right of (|5.5p is handled exactly as in Lemma 15.21 and 
tends to zero. We just require that ijn = k„/i„, with k„ — 0. 



With respect to the second term, write 



E 



1 ^ 

nF(/i„)^^{|^'-^-l>""} 

i—l 



< 



< E 



1 ^ 

nF{hn) jr[ ^{llni^-ni^lLlI— x.ll>').] 



1 ^ 



< 



ihr.) V 



i=l 



I,||^>r,„/J\/} 
> Vn/M 



at /z-almost all x and for some positive constant M. Applying finally statement 
(i) of Theorem 12. 11 we obtain 



E 



1 ^ 



= O 



f exp (-CnK^/M^ 

1^ FHK) 



which tends to zero whenever k„ = ^ foi' ^ sufficiently large Cs 



Lemma 5.5 Assume that Assumption Set R is satisfied and X has hounded 
support. Then, ifnF{hn) — oo, 

2 

'\og{nhiy 



E 



1 



nFihn) ^ 



= O 



ihl 
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Proof of Lemma 15.51 We deal only with the term 

r 1 " 1' 

«— 1 

since the other one may be addressed the same way. At this point, we have to 
get sharper into the bounds derived in Lemma [5.41 Let (k„)„>i be a positive 
sequence which tends to 0, and recall that 

£^ n £t 

c {V, < K} n [k <y^ + \ i^D - n^) (x - xo|| } 

c {V, < hn} n [{/i„ (1 - K„) < u { (iId - n^) (x - x,) > k„/i„}] 

= {hn (1 - K„) <Vi < hn} 

U {V^<hr^}n^^ (flij - (x - Xi) >K„/l„} . 



Thus 



i=l 

n n 

^ XI l{'»™(i-«.)<v.<h„} + XI ^{||(ni,-nx,)(x-x.)||>«,.h„}l{v.</i,.}: 



and therefore 



1 " 



< 2 



1 " 

2—1 



(5.6) 



Taking expectations and mimicking the method used in the proof of Lemma 
we easily obtain 



E 



1 ^ 



< Cni 



(5.7) 



It remains to bound the last term on the right-hand side of (|5.6p . To this aim, 
using the fact that X has bounded support, we may write, for /i-almost all x, 

l{||(nD-ni5)(x-x,)||>«„/i„}l{^.<''"} 

- ^{\\Un--nn\\^»^^h^/M}^{V,<hr,} 

(for some positive M) 
^ ^^■k^{v^<h„} + ^{\\ho-nD\\^>K.„h„/M}nA„^{v^<h^}^ 
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where the set An is the same as in Lemma 12.11 Clearly, by Cauchy-Schwarz 
inequality, 



E 



= 0(exp(-Cn)), (5.8) 
where the last inequality arises from Lemma l2. II It remains to bound the term 



E 



1 " 1' 

Tjr)T.\\\nn-Un\\^>..h./M}nAjm<h.} ■ 



nF{K) ^ 

z— 1 



We have 
E 



1 ^ 

(h) 1^ ^{\\no-no\\^>.^hjM}nAMv^<i-^} 



nF (hn) ^ 



Using again Cauchy-Schwarz inequality and a bound on the fourth moment of 
(l/nF{hn)) -'-{Vi</i„}i it suffices to bound accurately 



'({ 



> 



Unhn/M} n An) < P (||r„ - r||^ > KnhnC/M) 



= 0{exp{-CnKlhl)) 
where we used the bound (j2.3p and statement (i) in Theorem 12. II 
Collecting the bounds ^^-^1}-^^-^;^, we finally obtain 



(5.9) 



E 



1 ^ 

nF (hn) ^ 

l—l 



= 0{Kl)+0{exp{~CnKlhl)) 



The choice k*^ x exp (-CnK*^/i^) , i.e., 

*2 _ log ("^^) 

leads to the desired result. 

References 



[1] Bellman, R.E. (1961). Adaptive Control Processes: A Guided Tour, Prince- 
ton University Press, Princeton. 



22 



[2] Bingham, N.H., Goldic, CM. and Tcugcls, J.L. (1987). Regular Variations, 
Encyclopedia of Mathematics and its Applications, Cambridge University 
Press, Cambridge. 

[3] Bosq, D. (2000). Linear Processes in Function Spaces, Lecture Notes in 
Statistics 149, Springer-Verlag, New York. 

[4] Cardot, H., Mas, A. and Sarda, P. (2007). CLT in functional linear regression 
models, Probability Theory and Related Fields, 138, 325-361. 

[5] Dauxois, J.-Y., Pousse, A. and Romain, Y. (1982). Asymptotic theory for the 
principal component analysis of a random vector function: Some applications 
to statistical inference, Journal of Multivariate Analysis, 12, 136-154. 

[6] Devroye, L., Gyorfi, L. and Lugosi, G. (1996). A Probabilistic Theory of 
Pattern Recognition, Springer-Verlag, New York. 

[7] Dunford, N. and Schwartz, J.T. (1988). Linear Operators. PaH I, II, III, 
John Wiley and Sons, Hoboken. 

[8] Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: 

Theory and Practice, Springer, New York. 

[9] Friedman, J.H. (1987). Exploratory projection pursuit. Journal of the Amer- 
ican Statistical Association, 82, 259-266. 

[10] Gohberg, L, Goldberg, S. and Kaashoek, M.A. (1991). Classes of Linear 
Operators. Vol. I, II, Birkhauser, Basel. 

[11] Gyorfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A Distribution- 
Free Theory of Nonparametric Regression, Springer-Verlag, New York. 

[12] Jee, J.E. (1987). Exploratory projection pursuit using nonparametric den- 
sity estimation. Proceedings of the Statistical Computing Section of the 
American Statistical Association, 335-339. 

[13] JoUiffe, l.T. (2002). Principal Component Analysis, 2nd Edition, Springer- 
Verlag, New York. 

[14] Mas, A. and Menneteau, L. (2003). Perturbation approach applied to the 
asymptotic study of random operators, Progress in Probability, 55, 127-123. 

[15] Nadaraya, E.A. (1964). On estimating regression, Theory of Probability and 
its Applications, 9. 141-142. 

[16] Nadaraya, E.A. (1970). Remarks on nonparametric estimates for density 
functions and regression curves. Theory of Probability and its Applications, 
15, 134-137. 

[17] Parzen, E. (1962). On the estimation of a probability density function and 
the mode, The Annals of Mathematical Statistics, 33, 1065-1076. 

[18] Prakasa Rao, B.L.S. (1983). Nonparametric Functional Estim,ation (Prob- 
ability and Mathematical Statistics), Academic Press, London. 



23 



[19] Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a 
density function, The Annals of Mathematical Statistics, 27, 832-837. 

[20] Rudin, W. (1987). Real and Complex Analysis, Mc Graw-Hill, New York. 

[21] Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice, 
and Visualization, John Wiley and Sons, Hoboken. 

[22] Scott, D.W. and Wand, M.P. (1991). Feasibility of multivariate density 
estimates, Biometrika, 78, 197-205. 

[23] Silverman, B.W. (1986). Density Estimation for Statistics and Data Anal- 
ysis, Chapman and Hall, London. 

[24] Simonoff, J.S. (1996). Smoothing Methods in Statistics, Springer-Verlag, 
New York. 

[25] Stone, C.J. (1980). Optimal rates of convergence for nonparametric esti- 
mators. The Annals of Statistics, 8, 1348-1360. 

[26] Stone, C.J. (1982). Optimal global rates of convergence for nonparametric 
regression. The Annals of Statistics, 10, 1040-1053. 

[27] Watson, G.S. (1964). Smooth regression analysis, Sankhyd Series A, 26, 
359-372. 

[28] Wheeden, R.L. and Zygmund, A. (1977). Measure and Integral. An Intro- 
duction to Real Analysis, Marcel Dekker, New York. 



24 



